Universum Inference and Corpus Homogeneity

نویسندگان

  • Carl Vogel
  • Gerard Lynch
  • Jerom Janssen
  • J. Janssen
چکیده

Universum Inference is re-interpreted for assessment of corpus homogeneity in computational stylometry. Recent stylometric research quantifies strength of characterization within dramatic works by assessing the homogeneity of corpora associated with dramatic personas. A methodological advance is suggested to mitigate the potential for the assessment of homogeneity to be achieved by chance. Baseline comparison analysis is constructed for contributions to debates by nonfictional participants: the corpus analyzed consists of transcripts of US Presidential and Vice-Presidential debates from the 2000 election cycle. The corpus is also analyzed in translation to Italian, Spanish and Portuguese. Adding randomized categories makes assessments of homogeneity more conservative. 1 Background & Method Recent research in text classification has applied the assessment of corpus homogeneity to strength of characterization within fictional work [8]. The idea is that a character within a play is a strong character if the text associated with the character is homogeneous and distinct from other characters—the character is strong if a random sample of text of that character is more like the rest of the text of that character than it is like the text of other characters. Another possibility is that random samples of texts of a character reliably are most similar to its play, at least, if not its character. A playwright whose characters “find their author” in this sense, but not their characters or play, while still highly individual as an author, does not construct strong characters. One goal of this paper is to provide a baseline for comparison in which the contributions of individual characters are not scripted by a single author, but whose contributions have to be understood in light of each other’s statements, like dialog: we assess homogeneity of contributions to national election debates. Another focus of this work is in an attempt to improve the methodology for assessing the homogeneity of corpora. The method is related to inference with the universum in machine learning. Random data drawn from the same probability space as the corpora under consideration are considered among the actual corpora and categories within it. Inference with the universum involves approaching classification Computational Linguistics Group, Intelligent Systems Laboratory, O’Reilly Institute, Trinity College, Dublin 2, Ireland e-mail: vogel,gplynch,[email protected] ∗ This work is supported by Science Foundation Ireland RFP 05/RF/CMS002. C. Vogel, G. Lynch and J. Janssen tasks by supplementing data sets with data points that are not actually part of the categories from which a system is choosing, but which are realistic given the features under consideration [9]. The supplemental data sets, if well chosen, can sharpen the distinction between categories, making it possible to reclassify data points that otherwise fall in between categories. Part of the idea is that clashes with the universum should be maximized. Research in this area includes focus on how best to choose the universum [6]. One can use the same sort of reasoning to quantify the homogeneity of the categories in terms of their propensity to be confused with the universum material. As corpus homogeneity is assessed in part by rank similarity of files within it, the effect of adding random data is to diffuse the homogeneity of texts within a category since it is increasingly likely that randomly constructed data files will be the most similar to some of the actual texts. Thus, a category that is assessed as significantly homogeneous even with the addition of random data can be judged with greater confidence, with a reduction of the possibility of type I error. In §2 we apply our methods to assess the homogeneity of debate contributions of main the contributors to the US national election debates from 2000.2 Surprisingly, the transcripts do not reveal Bush or Gore to have provided self-homogeneous contributions in the sense used here (if they had been characters in a play, they would not have been among the strong characters). Adding fabricated contributions drawn from random sampling from the concatenation of the entirety of the actual data set alters the outcome by weakening some of the rank similarity measures within actual categories. The second experiment individuates the same corpus into a larger number of smaller categories: the categories are individuated by speaker and date, rather than aggregating across the debates. Then universum data is added. Finally, using translations of the debates into Italian, Portuguese and Spanish we turn the problem into one of language classification. On inspecting the texts of Bush vs. those of Gore, one might not think them as distinct from each other as texts of Italian are from those of Spanish. Whatever one’s prior expectations about the homogeneity of categories individuated by speaker, there are very clear intuitions about categorization by language (and the effectiveness of letter distribution analysis in underpinning language classification generally [1]). Thus, we are able to use the universum method to enhance the assessment of homogeneity in general instances of text classification problems, as well as in computational stylometry. The classification method used here involves several stages of analysis. A corpus of text is split into files indexed by categories. Files are balanced by size. In any one sub-experiment, the number of files in each category considered is balanced. Experiments are repeated hundreds of times, and average results analyzed. The first stage is to compute the pairwise similarity of all of the files in the sub-experiment. Similarity is based on n-gram frequency distributions, for whatever level of tokenization that is settled upon, and for whatever value of n [7]. In the experiments reported here, we use letter unigrams. Their efficacy in linguistic classification tasks is perhaps surprising, but they have repeatedly proven themselves 2 The presidential debates occurred on October 3, 2000, October 11, 2000, and October 17, 2000. The Vice Presidential debate occurred on October 5, 2000. The transcript source was http://www.debates.org/—last verified, June 2008. Universum Inference and Corpus Homogeneity [8], and perform well with respect to word-level tokenization [5]. However, other levels of tokenization are obviously also effective. An advantage of letter unigrams is that there is no disputing their individuation, and this renders it very easy to replicate experiments based on letter unigrams. This is important if the text classification task involves authorship attribution for forensic purposes [2]. The similarity metric used is the chi-by-degrees-of-freedom statistic suggested for the calculation of corpus homogeneity in the past by Kilgarriff, using word-level tokenization [3]. This essentially means calculating the χ2 statistic for each token in the pair of files under consideration, and averaging that over the total number of tokens considered. Normally, χ2 is used in inferential statistics to assess whether two distributions are significantly different; however, here we are using the value in the other direction, as a measure of similarity. With all of the pairs of files evaluated for their similarity, files within categories can be ranked for their overall similarity. For each file in a category, the Mann-Whitney rank ordering statistic is used to assess the goodness of fit of the file with respect to its own category (its a priori category), and with respect to all other categories under consideration on the basis of the ranks of pair-wise similarity scores. The best-fit alternative categories are recorded. Homogeneity of a category of files is measured with Bernoulli Schema. This is akin to tossing a coin in repeated experiments to assess whether the coin is fair. Here, the coin has c sides, one for each category that could be the best fit for a file. In any one fairness experiment, the c-sided coin is tossed n times, once for each file in the category. With hundreds of n-toss experiments, it is possible to assess whether the coin is fair: when the same side comes up often enough relative to those parameters, it is safe to reject the hypothesis that the coin is fair (that the category of files is randomly self-similar) and to accept that the category is significantly homogeneous.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Practical Analysis of the Universum SVM Learning

The idea of ‘inference through contradictions’ was introduced by Vapnik[1] in order to incorporate a priori knowledge into the learning process. This knowledge is introduced via additional unlabeled data samples (called virtual examples or the Universum) that are used along with labeled training samples, to perform an inductive inference. For example, if the goal of learning is to discriminate ...

متن کامل

An Analysis of Inference with the Universum

We study a pattern classification algorithm which has recently been proposed by Vapnik and coworkers. It builds on a new inductive principle which assumes that in addition to positive and negative data, a third class of data is available, termed the Universum. We assay the behavior of the algorithm by establishing links with Fisher discriminant analysis and oriented PCA, as well as with an SVM ...

متن کامل

Exploiting Universum data in AdaBoost using gradient descent

Recently, Universum data that does not belong to any class of the training data, has been applied for training better classifiers. In this paper, we address a novel boosting algorithm called UadaBoost that can improve the classification performance of AdaBoost with Universum data. UadaBoost chooses a function by minimizing the loss for labeled data and Universum data. The cost function is minim...

متن کامل

Selecting Informative Universum Sample for Semi-Supervised Learning

The Universum sample, which is defined as the sample that doesn’t belong to any of the classes the learning task concerns, has been proved to be helpful in both supervised and semi-supervised settings. The former works treat the Universum samples equally. Our research found that not all the Universum samples are helpful, and we propose a method to pick the informative ones, i.e., inbetween Univ...

متن کامل

Least Squares Universum Tsvm

Supervised learning problem with Universum data is a new research subject in machine learning. Universum data, which are not belonging to any class of the classification problem of interest, has been proved very helpful in learning. For data classification with Universum data, a novel quick classifier is proposed in this paper and named as least squares Universum twin support vector machine (LS...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2008